Fast  Multidimensional  Density  Estimation 
based  on  Random-width  Bins 


Leonard  B.  Heame 
and 

Edward  J.  Wegman 


Technical  Report  No.  109 
October,  1994 


Center  for 

Computational 

Statistics 


^9941201  025 


George  Mason  University 
Fairfax,  VA  22030 


CENTER  FOR  COMPUTATIONAL  STATISTICS 

TECHNICAL  REPORT  SERIES  (RECENT  REPORTS) 

TR  93.  Winston  C.  Chow,  Modeling  and  Estimation  with  Fractional  Brownian  Motion  and  Fractional 
Gaussian  Noise  (Ph.D.  Dissertation),  February,  1994. 

TR  94.  Mark  C.  Sullivan  and  Edward  J.  Wegman,  Correlation  Estimators  Based  on  Simple  Nonlinear 
Transformations,  February,  1994,  To  appear  IEEE  Transactions  on  Signal  Processing. 

TR  95.  Mark  C.  Sullivan  and  Edward  J.  Wegman,  Normalized  Correlation  Estimators  Based  on 
Simple  Nonlinear  Transformations,  March,  1994. 

TR  96.  Kathleen  Perez-Lopez  and  Arun  Sood,  Comparison  of  Subband  Features  for  Automatic 
Indexing  of  Scientific  Image  Databases,  March,  1994. 

TR  97.  Wendy  L.  Poston  and  Jeffrey  L.  Solka,  A  Parallel  Method  to  Maximize  the  Fisher  Information 
Matrix,  June,  1994. 

TR  98.  Edward  J.  Wegman  and  Charles  A.  Jones,  Simulating  a  Multi-target  Acoustic  Array  on  the 
Intel  Paragon,  June,  1994. 

TR  99.  Barnabas  Takacs,  Edward  J.  Wegman  and  Harry  Wechsler,  Parallel  Simulation  of  an  Active 
Vision  Model,  June,  1994. 

TR  100.  Edward  J.  Wegman  and  Qiang  Luo,  Visualizing  Densities,  October,  1994. 

TR  101.  Daniel  B.  Carr,  Converting  Tables  to  Plots,  October,  1994. 

TR  102.  Julia  Corbin  Fauntleroy  and  Edward  J.  Wegman,  Parallelizing  Locally- Weighted  Regression, 
October,  1994. 

TR  103.  Daniel  B.  Carr,  Color  Perception,  the  Importance  of  Gray  and  Residuals  on  a  Choropleth 
Map,  October,  1994. 

TR  104.  David  J.  Marchette,  Carey  E.  Priebe,  George  W.  Rogers  and  Jeffrey  L.  Solka,  Filtered  Kernel 
Density  Estimation,  October,  1994. 

TR  105.  Jeffrey  L.  Solka,  Edward  J.  Wegman,  Carey  E.  Priebe,  Wendy  L.  Poston  and  George  W. 
Rogers,  A  Method  to  Determine  the  Structure  of  an  Unknown  Mixture  Using  the  Akaike  Information 
Criterion  and  the  Bootstrap,  October,  1994. 

TR  106.  Wendy  L.  Poston,  Edward  J.  Wegman,  Carey  E.  Priebe  and  Jeffrey  L.  Solka,  A  Contribution 
to  the  Theory  of  Robust  Estimation  of  Multivariate  Location  and  Shape:  EID,  October,  1994. 

TR  107.  Clifton  D.  Sutton,  Tree  Structured  Density  Estimation,  October,  1994. 

TR  108.  Charles  A.  Jones,  Simulating  a  Multi-target  Acoustic  Array  on  the  Intel  Paragon  (M.S. 
Thesis),  October,  1994. 

TR  109.  Leonard  B.  Hearne  and  Edward  J.  Wegman,  Fast  Multidimensional  Density  Estimation 
based  on  Random-width  Bins,  October,  1994. 


Fast  Multidimensional  Density  Fstimation  based  on  Random-width  Bins 


Leonard  B.  Hearne^  and  Edward  J.  Wegman^ 
Center  for  Computational  Statistics 
George  Mason  University 
Fairfax,  VA  22030 


Abstract 

Histogram-type  density  estimators  have  some 
notable  computational  advantages  over  other  forms  of 
density  estimation  by  virtue  of  the  WARPing 
algorithm.  However,  traditional  fixed-bin-width  have 
less  than  satisfactory  smoothing  properties,  being  too 
coarse  in  regions  of  high  density  and  too  fine  in 
regions  of  low  density.  Scott  (1992)  suggests  the  ASH 
algorithm  as  a  means  of  overcoming  these  problems, 
but  the  ASH  algorithm  is  computationally  intensive 
somewhat  negating  the  benefits  of  WARPing. 
Wegman  (1975)  proposed  a  variable  bin-width 
technique  for  one  dimensional  density  estimators  and 
used  sieve-type  methods  to  show  strong  consistency 
results  that  did  not  depend  on  smoothness  properties 
of  the  underlying  density.  In  this  paper,  we  extend 
this  idea  to  high-dimensional,  variable  bin-width 
meshes.  The  boundaries  of  the  bins  are  determined 
by  a  random  subsampling  of  the  observations.  An 
extension  of  the  WARPing  algorithm  may  still  be 
used  for  fast  computation.  We  give  combinatorial 
arguments  for  calculating  the  number  of  bins  and  also 
the  conditional  expectation  and  variance  of  the 
number  of  observations  per  bin.  Conditional  on  the 
random  hyper-rectangular  tessellation,  we  calculate 
the  maximum  likelihood  density  estimator. 

Introduction 

In  this  paper,  a  density  estimation  method  is 
developed  that  is  computationally  more  tractable  than 


kernel  density  methods,  and  has  better  smoothing 
properties  than  traditional  fixed  binning  methods. 
The  basic  method  is  easy  to  describe  in  one 
dimension.  Randomly  select  a  subset  of  m 

observations  {U*}  from  a  set  of  n  observations  {T}, 
m  <  n,  together  with  the  marl  FI  and  min{Y}.  Order 
the  set  {y*}  in  the  set  |^(*- )j-  ^  random 

width  bins  iB)  can  be  can  be  constructed  using 
adjacent  elements  in  the  set  )!■•  Then  attribute 
the  probability  mass  of  all  observations  in  IF}  to  the 
bins  in  15}.  The  probability  density  on  an  element 
B-  G  iB)  is  the  relative  probability  mass  on  Re¬ 
divided  by  the  length  of  B-,  cf.  Wegman  (1975)  and 
Hearne  and  Wegman  (1991).  There  are  many  ways  to 
generalize  these  results  to  a  d-dimensional  support 
space.  The  generalization  that  we  have  adopted  here 
is  to  define  random-width  d-dimensional  rectangular 
bins  generated  by  a  random  sample  from  the  set  of 
observations. 

Random-width  d-Dimensional  Bin  Tessellation 
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Given  a  set  of  n  observations,  {y},  in  a  d- 
dimensional  Euclidian  space,  let  be  the  minimum 
d-dimensional  rectangular  cover  of  {Yl.  Each 
observation  Yj&iY}  can  be  written  in  the  form 
Yj- =  ^yj-,yj,---,yj  Then  can  be  defined  by 
the  set  of  maximum  and  minimum  values  for  the  d 
coordinate  axes, 

Ai  =  {xe  3?'':  x'  >  mm(y’)  A  x'  <  ma*(  Y')}. 

A  d-dimensional  rectangular  tessellation  of  A^ 
can  be  generated  by  selecting  a  random  subsample  of 
N  observations  {Sff}  from  {YK  For  each  of  the  d 
coordinate  axes  let  be  the  set  of  the 

coordinate  for  all  Y  together  with  maxiY') 

and  mm(y*).  Let  be  the  ordered  set  of  unique 

elements  in  =  card|5( .  A  set  of  one 

dimensional  bins,  {b*},  can  be  generated  for  each  of 
the  d  coordinate  axes  by  adjacent  elements  in  the  set 
and  card{B*}  =  s‘ -  1.  The  d-dimensional 
rectangular  random  tessellation  of  -^n 

be  generated  by  the  cross  product  of  the  sets  of  one 
dimensional  bins  for  each  coordinate  axis; 

=  and 

m  =  card! H  ~  1  )• 

'■  ''  1  =  1 

The  upper  bound  on  the  cardinality  of  the  set  of 
one  dimensional  bins  that  are  generated  for  each  of 
the  coordinate  axes  is  s’  —  1  <  A  -h  1 ,  1  <  i  <  d,  since 
the  random  sample  {5';^}  may  have  observations  that 
contain  max{Y*)  or  mm(Y‘),  observations  are 
recorded  only  to  finite  precision,  and  computers 
operate  on  a  subset  to  the  rational  numbers.  The 
cardinality  of  the  tessellation  fLen  has  an 

upper  bound,  given  the  random  subsample  |5 of 
m  =  card! =  ]^  (s*  —  1 )  <  ( A  -f  1 


In  Figure  1  a  set  of  observations  {Yl  in  3?^  have 
values  max(Y^),  miniY^),  max  (y^),  and  miniy^X 
These  values  define  the  minimum  2-dimensional 
rectangular  cover  of  {Yl.  A  random  subsample  of 
observations  is  drawn  from  {Yl,  {'S'3}  =  (Pi,P2)P3} 
These  three  points  together  with  the  maximum  and 
minimum  values  for  each  of  the  coordinate  axes 
generate  the  set  of  bins  of 

max{Y^) 


mm(Y^) 

min(Y^)  max{y^) 

Figure  1 
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The  tessellation  of  A^  is  adaptive  in  the 

sense  that  the  elements  of  the  tessellation  tend  to  be 
large  where  the  observations  are  sparse  and  small 
where  the  observations  are  not  sparse. 


Conditional  Expectation  and  Variance  of  the  Number 
of  Observations  per  Bin 

Let  Bf.,  I  <k  <m,  be  the  d-dimensional  bin 
in  the  tessellation  of  A^,  and  let  be  the 

number  of  observations  in  {Yl  that  are  in  B/^.  The 
expected  value  of  given  the  tessellation  i® 

the  number  of  observations  that  might  be  attributed 


to  the  fc’"  bin  times  the  probability  that  the  d- 
dimensional  random  variable  X  is  in  the  k*^  bin; 


^Z,\{Bi} 


^in-N)P{X 


Let  U\,  I  <i  <d,  be  the  empirical  probability 

bin, 


mass  on 


the  one  dimensional 

1  <  _;■  <  s*  —  1,  for  the  coordinate  axis, 

e‘i = B)  I  ib')). 


Using  order  statistical  arguments,  c/.  Rohatgi 
(1976)  pp. 575-580,  it  can  be  shown  that; 

I  {b‘}]  =  1  <  i  <  -  1,  and 

Since  the  tessellation  generated  by  the 

cross  product  of  the  one  dimensional  bins  on  each  of 
the  d  coordinate  axes  then  the  probability  mass  that 
is  on  a  given  d-dimensional  bin  Bf,  €  {^n}>  S'^en  the 
tessellation  |^n}) 

and 

Multiplying  by  the  number  of  observations  that  might 
be  attributed  to  a  d-dimensional  rectangular  bin, 
n  —  N,  and  applying  the  inequality  bounding  the 
cardinality  of  the  number  of  bins  in  the  tessellation; 

I  {^n}]  >  1  <  and 

I  fj.d\]^(n-NnN-ir^ 


A  Class  of  Probability  Density  Estimators 

Let  n  be  the  number  of  observations  in  the  set  of 
observations  {FI,  and  let  be  the  number  of 
observations  in  the  rectangular  bin  in  the 

tessellation  Let  W(^N be  the  probabilistic 

mass  of  observations  in  the  tessellation  generating  set 
{Spf]  that  are  attributed  to  an  adjacent  bin  in  the 


tessellation  by  the  function  1F(  •  )•  And 

let  be  the  d-dimensional  content  of  the 
element  of  the  tessellation.  Then  we  can  define  a 
class  of  probability  density  estimators  on  a 
tessellation  by; 


H+n^k) 


n-Ci 


f{xeB,)  = 


and 


0. 


This  class  of  probability  density  estimators  is 
constant  on  each  bin  in  the  tessellation,  and  the 
content  of  each  of  the  d-dimensional  bins  in  the 
tessellation  C/.  is  easily  computed.  The  probabilistic 
mass  attribution  function  W{.  •  )  is  closely  related  to 
the  likelihood  function. 


The  Likelihood  Function 

The  likelihood  function  was  introduced  as  a 
means  for  optimizing  the  parameter  values  in  the 
parametric  density  estimation  setting  so  that  the 
fitted  parametric  function  would  best  fit  a  set  of 
observations.  In  the  nonparametric  setting  the 
likelihood  function  has  utility  if  there  is  a  variable  in 
the  class  of  density  estimators.  The  weight  that  is 
attributed  to  bins  in  the  tessellation  by  observations 
in  {•S'yy}  is  variable  and  can  be  used  to  optimize  the 
likelihood  function. 

The  likelihood  function  for  this  class  of 
probability  density  estimators  is 

A  n,  +  W{N,) 


Ux) 


=  n 

j = 1 


n-C,  ' 

the  product  of  the  density  estimates  for  each  of  the 
observations.  But  the  class  of  density  estimators  that 
are  presented  here  are  estimators  on  the  set  of  bins  in 
the  tessellation  of  so  the  likelihood  function  can  be 
reformulated  in  terms  of  the  elements  of  the 


tessellation; 


k  =  l 


n-C, 


Taking  the  first  derivative  of  the  log  of  the 
likelihood  function  with  respect  to 


jjirrAT  JogL(x)=  ^ 
fc  =  l 


_ ,  n^k)  \ 

r^k  +  n^kV  nk  +  W{N,)) 

+  E^[fo9{nk  +  W{N^.))  -log{n-Ck)y 


If  the  first  derivative  is  set  equal  to  zero  and  solved 
for  W(^Nk)  then  the  estimator  will  be  optimized, 
either  maximized  or  minimized  depending  on  the  sign 
of  the  second  derivative  of  the  log  of  the  likelihood 
function.  Taking  the  second  derivative  of  the  log  of 
the  likelihood  function; 


^  71^ 

dW{Nkf  ~k^irik  +  W{Nk)‘ 

The  second  derivative  of  the  log  of  the  likelihood 

function  with  respect  to  W(^N k)  is  positive  on  all  bins 

in  the  tessellation  that  have  observations  in  them, 

tik  >  0,  and  is  undefined  where  Uk  =  0.  The  likelihood 

function  is  thus  convex  and  the  likelihood  function  is 

maximized  when  the  probabilistic  meiss  of  all 

observations  in  attributed  to  the  adjacent 

bin  where  will  be  largest. 

^k 


A  Random  Bin-width  Warping  Algorithm 

For  the  proposed  probability  density  estimation 
method  to  be  of  utility  it  is  important  that  density 
estimates  be  readily  computable,  given  a  set  of  n 
observations,  {FI,  in  a  d-dimensional  Euclidian  space. 
The  principal  computational  complexity  is  in  the 
attribution  of  observations  to  bins  in  the  tessellation, 
of  the  minimum  d-dimensional  rectangular 


cover  of  {FI,  A^.  In  conventional  fixed  width  binning 
methods  an  algorithm  called  warping  has  been 
developed  that  increases  the  speed  and  reduces  the 
computational  complexity  for  attributing  observations 
to  bins  in  the  tessellation.  This  algorithm  has  been 
extended  to  variable  bin-width  tessellations. 

Given  N  the  number  of  observations  in  the 
random  sample  of  observations  used  to  generate  the 
rectangular  bins  in  the  tessellation,  the  cardinality  of 
the  set  of  bins,  m,  is  bounded  by; 


m 


=  card{B4=  n  + 

For  each  coordinate  axis  there  is  an  upper  bound  on 
the  number  of  one  dimensional  bins  that  can  be 
generated.  Let  Bound_Values[i,  j]  be  a  matrix  with 
the  row,  0  <  i  <  d,  corresponding  to  and 

Bound_Value[i,0]=min(F’).  Then  for  each  row  x, 
0<i  <«’-!•  Let  Bin_lndex[t,  I:]  be  a  matrix  with 
the  i**  row  a  vector  of  integer  indices  into  the  matrix 
Bound_Values[i, j],  with  0<k<w',  where  w*  is  the 
selected  number  of  warping  indices  for  the 
coordinate  axis,  s'  —  1  <  w*. 


Let  6*  =  min{Y')  and  o'  = 


max' 


(F')-mm(F') 


for 


w 


the  coordinate  axis,  0  <  i  <  d.  For  any  point 


*'  e 


mirv 


iniv'),  maxiv') 


Index  =  Truncate 


then  the  value 
(x’-6') 


is  an  integer  in  the  range  0  <  Index  <  w'.  Let  the 
coordinate  axis  and  the  entry  in  the  matrix 
Bin_Index[i,  ik]  be  the  smallest  index  j  into  the 
matrix  Bound_Values[i,  j]  such  that 


a’(lndex  +  b')  <  Bound_Values[i,  j]. 


Then  an  efficient  algorithm  to  compute  the  bin  index 


for  the  coordinate  axis,  0  <i  <  d,  is  shown  in  the 
following  code  fragment. 

Get_Bin_Index(i,  x’) 

Table_Index  =  Truncate((x’  - 
Index  =  Bin_Index[i,  Table_Index] 

While(x’  >  Bound_Values[i,  Index])  Index++ 

Return  Index 

The  size  of  the  number  of  warping  indices,  w*,  is 
specified  by  the  user  of  the  density  estimation 
method.  The  question  of  how  large  w*  should  be  is  of 
interest.  We  want  to  maximize  the  probability  of 
selecting  the  correct  bin  index  on  the  first  attempt  for 
each  of  the  d  coordinate  axes.  The  bounds  on  the 
probability  of  selecting  the  correct  bin  index  on 
the  first  attempt  is; 

P(x‘  <  Bound_Values[i,Bin_Index[i,Table_Index]]) 

^  u;*  —  s*  +  1 
w' 

The  larger  w*  is  relative  to  s‘  —  1,  the  larger  the 
probability  that  the  correct  bin  index  will  be 
computed  on  the  first  attempt.  If  the  density 
function  is  symmetric  then  the  expected  value  of  the 

probability  is - : - . 

w' 

Conclusions  and  Extensions 

Random-width  binning  methods  are  a 
computationally  tractable  alternative  to  fixed-width 
binning  methods.  The  size  of  the  bins  in  a  d- 
dimensional  space  are  adaptive  so  that  the  bins  will 
tend  to  be  large  where  the  observations  are  sparse  and 
small  where  the  observations  are  not  sparse.  Bounds 
on  the  expected  value  and  variance  of  the  number  of 
observations  that  are  attributed  to  each  bin  can  be 


calculated,  given  the  size  of  the  subsample  that  is 
randomly  selected  from  the  set  of  observations  to 
generate  the  d-dimensional  bins.  The  likelihood 
function  is  convex  a  function  that  can  be  maximized 
or  minimize  to  give  a  maximum  entropy  estimate  by 
selecting  the  appropriate  probabilistic  weight 
distribution  function  W(.  • ),  cf.  Hearne  and  Wegman 
(1992).  By  applying  an  extension  to  the  WARPing 
algorithm,  the  computational  complexity  of  the 
random-width  binning  method  is  only  slightly  more 
computationally  intensive  than  fixed-width  binning 
methods. 

One  of  the  natural  extensions  to  random-width 
binning  methods  is  to  apply  a  resampling  scheme,  cf. 
Billard  and  LaPage  (1992).  Given  smoothness 
assumptions  about  the  underlying  probability  density, 
then  the  size  of  the  set  of  observations,  the  dimension 
of  the  observations  space,  and  the  expected  value  and 
variance  bound  on  the  number  of  observations  that 
are  attributed  to  each  bin  might  be  used  to  find  the 
optimal  subsample  size,  and  the  number  of  resampling 
repetitions  necessary  to  achieve  the  desired  density 
estimate  smoothness.  Resampling  in  an  optimal  way 
is  believed  to  be  less  computationally  intensive  than 
either  kernel  or  ASH  methods,  cf.  Scott  (1992). 
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